New York City Taxi Fare Prediction

This is created from the NYC Taxi Fare dataset which is available on Kaggle competition. I have created visualization analyzing the pick and drop off locations and created model to also predict fare of the trip on basis of pickup and dropoff locations

In this I will not be using all the data for visualizations as the data is too large and jupyter notebook keeps crashing if we use large number of rows. So we will use only 50000 rows for visualization and then for model building will try with 2 million rows for accurate model building

In [18]:
#Importing basic libraries first like numpy,scipy and seaborn
import numpy as np 
import pandas as pd
import warnings
warnings.simplefilter("ignore")
import numpy as np
import pandas as pd
from scipy.special import boxcox
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#PLOTLY
#Plotly can be used to create very interactive graphs and thus it is also imported
import plotly
import plotly.plotly as py
import plotly.offline as offline
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import cufflinks as cf
from plotly.graph_objs import Scatter, Figure, Layout
cf.set_config_file(offline=True)
In [42]:
# Importing both train and test 
train = pd.read_csv("train_nyctaxi.csv", nrows = 50_000)
test = pd.read_csv("test_nyctrips.csv",nrows = 50_000)
print(">>  Data Loaded")
>>  Data Loaded
In [43]:
#Checking train head
train.head()
Out[43]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2009-06-15 17:26:21.0000001 4.5 2009-06-15 17:26:21 UTC -73.844311 40.721319 -73.841610 40.712278 1
1 2010-01-05 16:52:16.0000002 16.9 2010-01-05 16:52:16 UTC -74.016048 40.711303 -73.979268 40.782004 1
2 2011-08-18 00:35:00.00000049 5.7 2011-08-18 00:35:00 UTC -73.982738 40.761270 -73.991242 40.750562 2
3 2012-04-21 04:30:42.0000001 7.7 2012-04-21 04:30:42 UTC -73.987130 40.733143 -73.991567 40.758092 1
4 2010-03-09 07:51:00.000000135 5.3 2010-03-09 07:51:00 UTC -73.968095 40.768008 -73.956655 40.783762 1
In [44]:
# Checking test head
test.head()
Out[44]:
key pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count
0 2015-01-27 13:08:24.0000002 2015-01-27 13:08:24 UTC -73.973320 40.763805 -73.981430 40.743835 1
1 2015-01-27 13:08:24.0000003 2015-01-27 13:08:24 UTC -73.986862 40.719383 -73.998886 40.739201 1
2 2011-10-08 11:53:44.0000002 2011-10-08 11:53:44 UTC -73.982524 40.751260 -73.979654 40.746139 1
3 2012-12-01 21:12:12.0000002 2012-12-01 21:12:12 UTC -73.981160 40.767807 -73.990448 40.751635 1
4 2012-12-01 21:12:12.0000003 2012-12-01 21:12:12 UTC -73.966046 40.789775 -73.988565 40.744427 1
In [45]:
# We see that pickup time is object it should be datetime. Similarly key should also be datetime
print(train.dtypes)
key                   object
fare_amount          float64
pickup_datetime       object
pickup_longitude     float64
pickup_latitude      float64
dropoff_longitude    float64
dropoff_latitude     float64
passenger_count        int64
dtype: object
In [23]:
# Converting key and pickup datetime to datetime format
train['key'] = pd.to_datetime(train['key'])
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'])
In [24]:
# Checking data types again to see that key and pickup_datetime is converted 
print(train.dtypes)
key                  datetime64[ns]
fare_amount                 float64
pickup_datetime      datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
dtype: object
In [25]:
# Checking null values in this 50000 row dataset
print(f"Numer of Missing values in train: ", train.isnull().sum().sum())
print(f"Number of Missing values in test: ", test.isnull().sum().sum())
Numer of Missing values in train:  0
Number of Missing values in test:  0
In [26]:
# Printing shape of dataset
print("Train shape {}".format(train.shape))
print("Test shape {}".format(test.shape))
Train shape (30000, 8)
Test shape (9914, 7)
In [27]:
# Plotting a histogram of the fareamount
# This graph shows majority of the fare is between zero and 50$ while there are other charges as well
target = train.fare_amount
data = [go.Histogram(x=target)]
layout = go.Layout(title = "Fare Amount Histogram")
fig = go.Figure(data=data, layout=layout)
iplot(fig)
In [28]:
# Showing our dataset is from 2009 to 2015
print(f">> Data Available since {train.key.min()}")
print(f">> Data Available upto {train.key.max()}")
>> Data Available since 2009-01-01 01:31:49.000000300
>> Data Available upto 2015-06-30 22:42:39.000000140

Interactive Plots using Plotly and Mapbox

In [29]:
data = [go.Scattermapbox(lat= train['pickup_latitude'] ,lon= train['pickup_longitude'],customdata = train['key'],mode='markers',
            marker=dict(size= 4,color = 'gold',opacity = .8,),)]
layout = go.Layout(autosize=False,mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
                                bearing=10,
                                pitch=60,
                                zoom=13,
                                center= dict(lat=40.721319,lon=-73.987130),
                                style= "mapbox://styles/shaz13/cjiog1iqa1vkd2soeu5eocy4i"),
                                width=900,
                                height=600, title = "Pick up Locations in NewYork")
In [30]:
fig = dict(data=data, layout=layout)
iplot(fig)
In [31]:
data = [go.Scattermapbox(
            lat= train['dropoff_latitude'] ,
            lon= train['dropoff_longitude'],
            customdata = train['key'],
            mode='markers',
            marker=dict(
                size= 4,
                color = 'red',
                opacity = .8,
            ),
          )]
layout = go.Layout(autosize=False,
                   mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
                                bearing=10,
                                pitch=60,
                                zoom=13,
                                center= dict(
                                         lat=40.721319,
                                         lon=-73.987130),
                                style= "mapbox://styles/shaz13/cjk4wlc1s02bm2smsqd7qtjhs"),
                                width=900,
                                height=600, title = "Drop off locations in Newyork")
fig = dict(data=data, layout=layout)
iplot(fig)

In this we are going to define different scenarios for analysis, Like where is majority of traffic is located during business days and weekends etc

In [32]:
train['pickup_datetime_month'] = train['pickup_datetime'].dt.month
train['pickup_datetime_year'] = train['pickup_datetime'].dt.year
train['pickup_datetime_day_of_week_name'] = train['pickup_datetime'].dt.weekday_name
train['pickup_datetime_day_of_week'] = train['pickup_datetime'].dt.weekday
train['pickup_datetime_day_of_hour'] = train['pickup_datetime'].dt.hour
In [33]:
business_train = train[train['pickup_datetime_day_of_week'] < 5 ]
business_train.head(5)
Out[33]:
key fare_amount pickup_datetime pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count pickup_datetime_month pickup_datetime_year pickup_datetime_day_of_week_name pickup_datetime_day_of_week pickup_datetime_day_of_hour
0 2009-06-15 17:26:21.000000100 4.5 2009-06-15 17:26:21 -73.844311 40.721319 -73.841610 40.712278 1 6 2009 Monday 0 17
1 2010-01-05 16:52:16.000000200 16.9 2010-01-05 16:52:16 -74.016048 40.711303 -73.979268 40.782004 1 1 2010 Tuesday 1 16
2 2011-08-18 00:35:00.000000490 5.7 2011-08-18 00:35:00 -73.982738 40.761270 -73.991242 40.750562 2 8 2011 Thursday 3 0
4 2010-03-09 07:51:00.000000135 5.3 2010-03-09 07:51:00 -73.968095 40.768008 -73.956655 40.783762 1 3 2010 Tuesday 1 7
5 2011-01-06 09:50:45.000000200 12.1 2011-01-06 09:50:45 -74.000964 40.731630 -73.972892 40.758233 1 1 2011 Thursday 3 9
In [34]:
early_business_hours = business_train[business_train['pickup_datetime_day_of_hour'] < 10]
late_business_hours = business_train[business_train['pickup_datetime_day_of_hour'] > 6]
In [36]:
data = [go.Scattermapbox(
            lat= early_business_hours['dropoff_latitude'] ,
            lon= early_business_hours['dropoff_longitude'],
            customdata = early_business_hours['key'],
            mode='markers',
            marker=dict(
                size= 5,
                color = 'red',
                opacity = .8),
            name ='early_business_hours'
          ),
        go.Scattermapbox(
            lat= late_business_hours['dropoff_latitude'] ,
            lon= late_business_hours['dropoff_longitude'],
            customdata = late_business_hours['key'],
            mode='markers',
            marker=dict(
                size= 5,
                color = 'cyan',
                opacity = .8),
            name ='late_business_hours'
          )]
layout = go.Layout(autosize=False,
                   mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
                                bearing=10,
                                pitch=60,
                                zoom=13,
                                center= dict(
                                         lat=40.721319,
                                         lon=-73.987130),
                                style= "mapbox://styles/shaz13/cjiog1iqa1vkd2soeu5eocy4i"),
                    width=900,
                    height=600, title = "Early vs. Late Business Days Pickup Locations")
fig = dict(data=data, layout=layout)
iplot(fig)
In [37]:
weekend_train  = train[train['pickup_datetime_day_of_week'] >= 5 ]
early_weekend_hours = weekend_train[weekend_train['pickup_datetime_day_of_hour'] < 10]
late_weekend_hours = weekend_train[weekend_train['pickup_datetime_day_of_hour'] > 6]
In [38]:
data = [go.Scattermapbox(
            lat= early_weekend_hours['dropoff_latitude'] ,
            lon= early_weekend_hours['dropoff_longitude'],
            customdata = early_weekend_hours['key'],
            mode='markers',
            marker=dict(
                size= 5,
                color = 'violet',
                opacity = .8),
            name ='early_weekend_hours'
          ),
        go.Scattermapbox(
            lat= late_weekend_hours['dropoff_latitude'] ,
            lon= late_weekend_hours['dropoff_longitude'],
            customdata = late_weekend_hours['key'],
            mode='markers',
            marker=dict(
                size= 5,
                color = 'orange',
                opacity = .8),
            name ='late_weekend_hours'
          )]
layout = go.Layout(autosize=False,
                   mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
                                bearing=10,
                                pitch=60,
                                zoom=13,
                                center= dict(
                                         lat=40.721319,
                                         lon=-73.987130),
                                style= "mapbox://styles/shaz13/cjiog1iqa1vkd2soeu5eocy4i"),
                    width=900,
                    height=600, title = "Early vs. Late Weekend Days Pickup Locations")
fig = dict(data=data, layout=layout)
iplot(fig)
In [39]:
high_fares = train[train['fare_amount'] > train.fare_amount.mean() + 3* train.fare_amount.std()]
In [40]:
high_fares.head()

data = [go.Scattermapbox(
            lat= high_fares['pickup_latitude'] ,
            lon= high_fares['pickup_longitude'],
            customdata = high_fares['key'],
            mode='markers',
            marker=dict(
                size= 8,
                color = 'violet',
                opacity = .8),
            name ='high_fares_pick_up'
          ),
        go.Scattermapbox(
            lat= high_fares['dropoff_latitude'] ,
            lon= high_fares['dropoff_longitude'],
            customdata = high_fares['key'],
            mode='markers',
            marker=dict(
                size= 8,
                color = 'gold',
                opacity = .8),
            name ='high_fares_drop_off'
          )]
layout = go.Layout(autosize=False,
                   mapbox= dict(accesstoken="pk.eyJ1Ijoic2hhejEzIiwiYSI6ImNqYXA3NjhmeDR4d3Iyd2w5M2phM3E2djQifQ.yyxsAzT94VGYYEEOhxy87w",
                                bearing=10,
                                pitch=60,
                                zoom=13,
                                center= dict(
                                         lat=40.721319,
                                         lon=-73.987130),
                                style= "mapbox://styles/shaz13/cjk4wlc1s02bm2smsqd7qtjhs"),
                    width=900,
                    height=600, title = "High Fare Locations")
fig = dict(data=data, layout=layout)
iplot(fig)